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Abstract 

Due  to  the  rapid  growth  of  electronically  accessible  content  from  the  Internet,  there  is  a corresponding 
increase  in  demand  for  information  of  all  types  from  a number  of  diverse  users.  Although  the  World-Wide  Web 
presents  tremendous  opportunities  to  users  for  access  to  this  wealth  of  information,  the  quantity  of  that 
information  can  be  overwhelming.  The  user  who  attempts  to  find  information  can  become  confounded  by  the 
sheer  volume  of  data  and  information  returned  as  “pertinent  ” to  his/her  need.  In  addition,  current  awareness 
becomes  an  obstacle,  as  variations  in  search  engine  crawls  of  the  Web,  as  well  as  the  user ’s  own  ability  to 
keep  up  with  frequent  queries  to  multiple  search  tools,  can  prevent  timely  access  to  and  knowledge  of  pertinent 
information.  This  session  will  focus  on  the  various  Internet  search  engines,  directories,  and  how  to  improve 
the  user  experience  through  the  use  of  such  techniques  as  metadata,  meta-search  engines,  subject  specific 
search  tools,  and  other  developing  technologies. 


1.0.  Background 

Ever  since  the  Internet’s  beginnings  in  the  1990s,  the  amount  of  information  available  on  the  World-Wide 
Web  has  steadily  increased.  It  is  estimated  that  over  1 billion  web  pages  exist  on  the  World-Wide  Web 
today.  As  expected,  this  number  is  continuing  to  grow;  however,  at  a much  slower  and  some  say  more 
controlled  rate.  The  rate  of  growth  of  World-Wide  Web  content  has  also  caused  the  community  of  casual  and 
advanced  users,  to  consider  alternatives  means  to  finding  information. 

As  the  information  content  has  grown  on  the  World-Wide  Web,  so  too  has  the  need  for  improved  tools  and 
products  to  aid  users  in  this  discovery  of  information.  Several  tools  basically  perform  the  same  function,  but 
may  differ  slightly  in  their  methods  and  results.  This  primarily  has  to  do  with  vendor  specific  interpretation  of 
World-Wide  Web  terms  such  as:  Spam,  spider/crawler  configurations,  and  collection  size.  All  of  this  leads  to 
industry  estimates  that  less  than  20%  of  the  entire  content  of  the  World-Wide  Web  is  available  to  the  typical 
user  (World-Wide  Web  Consortium  2002).  This  paper  investigates  various  terminologies  and  provides  simple 
techniques  users  can  perform  to  improve  their  search  experiences  on  the  World-Wide  Web. 


Paper  presented  at  the  RTO IMC  Lecture  Series  on  ‘‘Electronic  Information  Management  for  PfP  Nations  ", 
held  in  Vilnius,  Lithuania,  24-26  September  2002.  and  published  in  RTO-EN-026. 
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2.  Basic  Terminology’ 

2.1.  What  do  Internet  search  engines  really  see? 

From  a user’s  perspective,  as  shown  in  Figure  1,  users  often  simply  enter  a term  in  a simple  search  box  and 
wait  for  results.  They  are  oblivious  to  what  the  computer  or  system  is  doing.  This  is  the  way  it  should  be. 
If  users  have  to  worry  about  how  an  Internet  search  engine  is  configured  or  what  it  expects,  then  most  likely 
the  search  engine  user  interface  needs  to  be  redesigned  or  another  product  selected.  Users  have  too  many 
other  things  to  do,  whether  at  work  or  home,  to  concern  themselves  with  learning  the  various  idiosyncrasies  of 
each  Internet  search  engine. 


End  User  perspective ... 


Figure  1.  Typical  User  Search 


However,  what  the  user  often  does  not  realize  is  that  Internet  search  engines  primarily  read  the  underlying 
document  codes  or  “metatags”  within  a web  document.  Metatags  are  document  tags  or  properties  that  are 
often  stored  within  the  Header  of  an  HTML  document.  Figure  2 below  describes  a typical  view'  that  an 
Internet  search  engine  would  see. 


m 


NBII  BioBot-  Notepad 


I-Tolxl 


File  Edit  Search  Help 


<t DOCTVPE  HTML  PUBLIC  "-//U3C//DTD  HTML  4.0  Transitional//EN"> 

<! — saved  from  url=(O039)http://pl(p. profusion. com/partners/NBII/  — > 
<HTHLXHEflDXTITLE>NBl  I BioBot</TITLE> 

<META  content=text/htnl ;CHARSET=iso-8859-1  http-equiv=Content-Type> 
<META  content=,,MSHTML  5 . 00.2722 .2800"  nane=GENERATORX/HEAD> 

<B0DV  aLink=#ff 0000  background="NBI  I BioBot-files/background.gif 
bgColor=#ffffff  link=#0O0099  vLink=#B06633> 

<PXIMG  align=bottom  alt="BioBot  Banner"  border=0  height=72 
src="HBII  BioBot_files/biobotbanner . jpg"  width=580X/AXBR> 

<P> 

<TABLE  border=0  cellPadding=0  cellSpacing-0  width="100%"> 

<TB0DV> 

<TR> 

<TD  uAlign=topXf — Read  in  Biodiversity  Navagation  Here — XBRXIMG 
align=top  alt=-  border=0  height=1  src="NBII  BioBot_files/pixel.gi 
width=3>  <A  href="http://www.nbii.gov/partner/"XIMG  align=botton 
alt=”NBII  Partnership"  border=0  height=121  src="NBII  BioBot_fil.es 
width=72X/A> 

<PX/P> 

<PXA  href=“http://www. nbii.gov/index.html"  title="NBII  Home“XIM 
align=botton  alt=“NBII  Home  Page”  border=0  height=25 
src=”HBI I BioBot_files/3buttons.gif"  width=80X/A>  <BRXA 
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Figure  2.  Typical  Internet  document  as  viewed  by  search  engines 


2.2.  What  is  Spam? 

“Spam”  is  a term  you  often  hear  thrown  about  on  the  World-Wide  Web  today.  Spam  is  not  just  a popular 
Hawaiian  luncheon  meat  anymore.  Understanding  what  spam  is  and  is  not  is  very  important  in  understanding 
how  search  engines  on  the  WWW  discover  and  display  information  to  users.  Spam  is  considered  to  be 
anything  that  a software  developer  or  HTML  creator  does  to  try  to  falsify  his  or  her  content  to  a web  engine. 
In  today’s  web  environment  content  creators  jockey  for  position  on  Internet  search  engines  results/hits  lists 
and  often  resort  to  categorizing  their  sites  in  ways  that  may  not  truly  represent  the  content  or  overall  purpose. 
This  is  considered  spamming  a search  engine  crawler  or  data  harvester.  Tricks  commonly  employed  by  web 
content  creators  include  applying  keywords  within  the  Header  section  of  an  HTML  document  that  have 
nothing  to  do  with  their  site,  or  simply  creating  BLANK  HTML  pages  with  white  text  so  that  users  don’t  see 
the  content,  but  a search  engine  can.  Internet  Search  Engines  are  all  wise  to  these  tricks  and  this  is  why  it  is 
often  difficult  for  content  producers  and/or  developers  who  have  truthful  content  and  are  trying  to  do  a good 
job  in  making  their  content  available  understand  what  an  Internet  search  engine  expects  and  applies 
preferences  to. 


2.3  The  basic  Internet  search  engine  model 

Internet  search  engines  on  the  WWW  “harvest”  data  from  publicly  available  web  sites.  This  harvesting  or 
gathering  of  summary  information  (usually  items  such  as  URL,  keywords,  X number  of  characters  from  the 
full-text  of  the  site)  to  a central  point  is  done  with  spiders  and/or  crawlers.  Spiders  and  crawlers  are  simply 
automated  jobs  or  processes  that  run  from  an  Internet  search  engine  provider’s  server  and  scour  the  WWW  for 
content.  This  content  is  then  made  available  through  the  Internet  search  engine  providers’  central  index. 
Figure  3 below  demonstrates  this  process. 
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Harvesting  model. . . 


Figure  3.  Basic  Internet  search  engine  harvesting  model 


2.4.  What  are  Metatags  and  why  are  they  important? 

Embedding  metatags  within  the  HTML  of  your  Web  site  not  only  aids  promotes  higher  ranking,  and  thus, 
better  retrieval,  of  your  site  by  many  of  the  major  search  engines,  but  also  provides  a foundation  for  future 
information  retrieval  and  discovery  on  the  Web.  The  algorithms  used  by  search  engines  constantly  change; 
however,  the  presence  of  metatags  on  your  pages  can  often  make  a dramatic  difference  in  enabling  users  to 
find  your  information.  Remember,  too,  that  as  various  sites  use  of  metatags,  an  integrated  system  whereby 
users  can  easily  locate  your  site  through  a search  engine  are  likely  to  explore  other  related  sites  within  the 
WWW. 

The  tables  below  describes  both  standard  metatags  and  unique  discipline,  in  this  example  biological 
information,  metatags  that  all  can  be  implemented  on  Web  sites.  Some  tags  are  required  by  search  engines, 
while  others  are  optional,  depending  upon  the  scope  and  context  of  the  page(s)  under  development.  Additional 
meta-tag  requirements  may  be  added  as  retrieval  tools  become  more  sophisticated.  Fortunately,  the  creation 
and  editing  of  metatags  is  a quick  and  simple  process,  thanks  to  the  development  of  meta-tag  software  which 
can  rapidly  generate  tags  selected  by  a content  provider  across  designated  pages,  directories,  or  an  entire  site. 

The  metatags  in  Table  1 below  are  all  standard  HTML  3.0  or  above  supported  tags.  If  users  are  using 
dynamically  created  web  sites,  the  metatags  described  below  can  simply  be  created  automatically  out  of  a 
database  dump  or  export. 
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Metata 


Author 


Definition 


Format  & Sample  Value 


The  Author  Tag  contains  name  of  the  content  provider  (not  <meta  name- 'author" 

the  Webmaster  / programmer).  content-'Bob  Johnson"> 


Even  though  the  Title  tag  is  not  considered  a true  metatag,  I <TITLE>  West  Nile  Virus: 
it  is  critical  in  search  engines’  ranking  algorithms,  and 
provides  users  with  general  information  about  your  page. 

Search  Engines  results/hit  lists  also  display  the  Title  tag. 

Up  to  80  characters  can  be  contained  within  this  tag. 


Page 

Description 


Language 


Keywords  are  probably  the  most  important  meta-tag  that  a 
Web  site  manager  can  include.  Up  to  1000  characters 
can  be  contained  within  this  tag. 

Your  keyword  contents  should  include  the  basic  tags  at 
left,  plus  all  terms  relevant  to  your  site  and  particular  sub- 
sections. Include  several  generic  terms  that  apply  to  your 
entire  node,  plus  terms  specific  to  various  sub-directories 
and  pages.  Try  to  think  of  as  many  synonyms  for  your 
terms  as  you  can.  Note  that  you  need  to  include  term 
variations  (e.g.  bird,  birds,  birding,  birdwatcher),  as  the 
search  engines  do  not  employ  stemming  when  parsing 
keywords.  Spelling  counts!  Use  terms  found  within  the 
page  contents  to  boost  relevancy  rankings. 


The  Description  tag  is  used  by  search  engines  to  display 
information  about  your  page  and  to  index  its  contents.  Up 
to  200  characters  can  be  contained  within  this  tag.  The 
description  often  determines  whether  the  searcher  will 
choose  to  view  your  page.  Make  the  description  relevant 
to  the  particular  sub-section  or  page;  don’t  rely  on  one 
generic  description  for  all  pages  on  your  site.  Use  keyword 
tag  terms  in  your  description  to  boost  term  relevancy 
rankings. 


Even  though  most  content  on  the  web  is  in  English,  the 
Language  tag  adds  value  to  your  Web  site,  helping  users 
limit  search  engine  retrieval  to  a particular  language. 


<meta  name-'keywords" 
content=Myour  page-specific 
keywords....,  NBII,  National 
Biological  Information 

Infrastructure,  biology, 

biodiversity,  natural  resources, 
reference,  education "> 

place  these  standard  keywords 
AFTER  your  page-specific 
keywords 


<meta  name-'description" 
content-'This  is  the  textual 
description  for  -your  page. 
Please  make  sure  your 
spelling  is  correct  and  include 
any  relevant  keywords  within 
the  Description  tag."> 


<meta  name-'language" 
content="en-us"> 


Classification 

The  Classification  tag  is  often  used  by  a number  of  the 
Web  search  engines  when  you  register  your  site  and/or 
when  your  site  is  indexed  so  that  your  site  can  be  classified 
with  other  similar  sites.  Typical  values  include: 

“Government,  Science,  Education,  etc.”. 

<meta  name-'classification" 

content-'Government, 

Science"> 

RatingsIPICS 

The  Ratings  and  PICS  tags  are  used  by  Internet  providers 
and  search  engines  to  limit  access  to  a particular  page. 
Often  this  is  used  to  restrict  access  to  “Mature  Audience 
Only”  pages  for  children  using  the  Internet.  Typical  Values 
include:  “General,  Restricted,  Mature,  Safe  for  Kids”,  etc. 
Because  filters  are  becoming  more  common  within  retrieval 
tools  and  browsers,  or  as  added  software,  these  tools  may 
arbitrarily  block  your  site  if  the  tag  is  not  implemented. 

<meta  name-'rating” 

content-'General,  Safe  for 
Kids"> 
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Table  2 below  describes  the  unique  or  custom  metatags  for  a domain  specific  organization.  In  this  case,  these 
custom  metatags  are  relevant  to  categorizing,  displaying,  and  delivering  biological  data  and  information. 


Table  2.  Domain  Specific  Metatags  (Custom  tags) 


Metatag 

Definition 

Format  & Sample  Value 

Species 

Scientific 

Name 

The  Scientific  Name  of  a particular  Species  on  the  web 
page  being  classified.  NBII  Partners  are  strongly 
encouraged  to  utilize  the  Integrated  Taxonomic  Information 
System  (ITIS)  (http://www.itis.usda.gov/plantproj7itis/ 
index.html)  as  its  basis  for  completing  this  information. 

<meta  name="Species 

Scientific  Name" 

content-’Parnassius 
smintheus"> 

Species 

Common 

Name 

The  Common  Name  of  a particular  Species  on  the  Web 
page  being  classified.  The  Common  Name  is  extremely 
important  to  both  expert  and  novice  users  for  finding 
information  about  a particular  species.  ITIS  is  a source  for 
completing  this  meta-tag. 

<meta  name-’Species 

Common  Name" 

content="Rocky  Mountain 
Parnassian"> 

Organization 

The  lead  Partner  organization  that  maintains  the  specific 
Web  site/page  being  classified.  The  use  of  standard 
controlled  lists  is  strongly  encouraged  for  completing  this 
field. 

<meta  name-’Organization” 
content-’USGS  Center  for 
Biological  Informatics’^ 

Web-site 

Theme 

The  high-level  Theme  (Education,  etc.)  that  your  Web 
page  falls  under  within  a web  structure. 

<meta  name-’website 

Theme” 

content="Education"> 

Wety-site 

Category 

The  specific  Category,  within  the  website  Theme,  that  your 
Web  page  falls  under. 

<meta  name-’website 

Category"  content-’General 
Curriculum"> 

Domain  specific  metatags  greatly  aid  a particular  community  of  users  in  the  discovery  and  identification  of 
quality  resources.  For  example,  if  a user  accesses  one  of  the  search  engines  on  the  World-Wide  Web  today 
and  searches  for  a specific  bird,  i.e.  “common  loon”,  the  search  result  produces  a hit  list  of  more  than  13 
million  results.  Some  of  these  results  are  most  likely  pertinent  to  the  user,  but  most  are  not  and  it  is  infeasible 
for  a user  to-navigate  through  13  million  web  pages  for  relevant  data. 

To  resolve  this  issue,  programs  such  as  the  National  Biological  Information  Infrastructure 
(http://www.nbii.gov)  have  been  implementing  a refined  and  improved  spidering  methodology  with  its 
partners  and  applying  metatags  within  its  local  and  partner  pages.  As  a result,  users  can  now  easily  narrow 
their  results  lists  to  62,000  web  pages  with  the  same  search  that  yielded  over  13  million  results.  These 
spidered  and  indexed  pages  are  primarily  biological  in  nature  and  due  to  the  intellectual  effort  that  is  currently 
ongoing  within  the  NB1I  Program  for  adding  information  content  to  the  NBI1  System,  users  can  expect  to 
receive  more  targeted  and  a higher  quality  result  than  directly  access  the  WWW  and  its  search  engines.  Users 
also  have  the  ability  to  narrow'  their  search  results  to  1,400  web  pages  and  information  sources  through  the 
direct  querying  of  meta-information  contained  within  a domain  specific  or  custom  meta-tag  called  “Common 
Name”.  As  one  can  imagine,  this  saves  users  tremendous  time  and  presents  authoritative  and  related 
information  to  a user  without  requiring  an  already  information  overloaded  user  to  review'  a large  number  of 
primarily  non-pertinent  results. 
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3.0.  Typical  search  engine  features  and  capabilities 

As  stated,  all  search  engines  are  mostly  the  same,  but  often  different  in  their  implementation  and 
configurations.  Below  are  some  of  the  features  you  would  find  in  a typical  search  engine.  Often  low-end 
search  engines  may  or  may  not  have  all  of  the  features  noted  or  may  be  limited  in  how  many  documents  one 
may  index  or  contain  within  your  collection. 

• Contains  an  automated  spider  or  crawler 

• No  theoretical  limits  in  the  amount  of  indexing  (limited  by  hardware) 

• Support  remote  indexing 

• Continual  background  indexing  of  content 

• Custom  metatag  support  (some  low-end  products  do  not  support  this  feature) 

• Support  for  indexing  PDF,  .doc,  etc  (some  low-end  products  do  not  support  this  feature) 

• Supports  URL  and  word  exclusions  & inclusions 

• SSI  supported 

• Search  by  custom  metatags 

• Case  sensitive  or  insensitive  searching 

• Simple 

• Customizable  search/results  pages 

• Boolean  Searching  capabilities 

• Provide  users  meta  description  and  page  title  in  search  results 

• Inexpensive  - $200 

• Easily  customizable  search/results  interface 

• Result  weighting  feature 

• URL  Inclusion  list 

• Require  significant  memory  (RAM)  and  disk  space  as  the  collection  grows 

• Low-end  alternatives  often  do  not  possess  the  capabilities  to  do  phrase  or  natural  language 
searching. 

4.0.  What  you  can  do  as  a content  developer  or  software  developer  to  improve  discovery  of  your 
content? 

Users  can  do  several  things  to  help  ensure  that  their  information  content  is  more  readily  found  on  the  WWW 
today.  Some  of  these  things  make  perfect  sense,  but  users  often  do  not  dedicate  the  necessary  resources 
required  to  make  them  happen  on  a regular  basjs.  Each  environment  and  web  site  is  different;  however,  the 
general  principles  and  techniques  noted  below  will  help  any  web  content  producer. 

• Implement  metatags  on  your  and  your  partners  web  sites 

• Update  content  frequently 

• Register  your  site  with  the  major  search  engines  (tools  exist  to  aid  in  this  process) 

• Perform  a basic  study  of  where  your  site  results  within  the  major  search  engine  providers 

• Do  not  spam  the  search  engine  providers 

• Re-evaluate  your  web  site  directory  structure  to  ensure  information  is  appropriately 
categorized/described  within  your  URL  strings 

• Look  through  your  server  log  files  to  determine  what  users  are  trying  to  find  on  your  site 
and/or  the  path  they  are  using  to  find  information 

• Perform  basic  usability  testing  of  your  site  to  determine  what  users  expect  and  can  easily 
gather  from  your  site.  This  also  may  determine  why  users  go  to  an  Internet  search  engine 
provider  versus  accessing  your  site  directly. 
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• Realize  that  Internet  search  engines  don’t  all  act  the  same,  index  at  the  same  time  period,  and 
often  value  a particular  metatag,  document  date,  etc.  more  than  another  vendor  product. 

5.0.  Conclusion 

As  one  can  see,  maintaining  awareness  and  improving  delivery  of  your  information  via  the  WWW  in  today’s 
environment  is  almost  a full-time  job.  As  Internet  search  engine  providers  become  more  sophisticated,  so  too 
will  it  be  necessary  for  content  producers  and  providers  to  restructure  their  information  to  take  advantage  of 
such  capabilities.  With  the  advent  of  new  technologies,  such  as  XML  and  SOAP,  information  content  will  be 
more  readily  able  to  be  delivered  at  a more  granular  scale  and  to  a more  targeted  audience.  However,  these 
technologies  are  still  in  their  infancy,  as  it  comes  to  the  overall  web  content,  and  Internet  search  engines  will 
continue  to  be  one  of  the  major  sources  whereby  users  access  to  gather  information. 
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